feat: record model thoughts by dannykopping · Pull Request #203 · coder/aibridge

dannykopping · 2026-03-05T15:37:10Z

Required for coder/coder#22676
Closes #168

This PR adds recording of model "thoughts" (sometimes call "reasoning"). This is only available for /v1/messages and /v1/responses. We only track model thoughts in relation to tool use; all other thoughts are irrelevant in the same way that we don't store regular inference responses.

I've raised a PR downstack to allow parallel tool calls when there are no injected tools.

This leaves open the question of how we will associate model thoughts to these tools. Right now the two supported APIs do not "interleave" thoughts per tool use: all the thinking is done upfront which results in the multiple tool uses. Here are two examples:

/v1/messages

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"thinking_delta","thinking":"The"}              }

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"thinking_delta","thinking":" user is asking me to get"}       }

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"thinking_delta","thinking":" stock prices for Apple and Google and compare"}           }

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"thinking_delta","thinking":" their"}               }

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"thinking_delta","thinking":" performance over the last week."}        }

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"thinking_delta","thinking":"\n\nLet me search"}    }

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"thinking_delta","thinking":" for both"}      }

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"thinking_delta","thinking":" stock prices in"}    }

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"thinking_delta","thinking":" parallel"} }

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"thinking_delta","thinking":"."}       }

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"signature_delta","signature":"..."}      }

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: content_block_start
data: {"type":"content_block_start","index":1,"content_block":{"type":"tool_use","id":"toolu_01KLHTi1EnM4RxxG9VfFBwDt","name":"WebSearch","input":{},"caller":{"type":"direct"}}     }

...

event: content_block_start
data: {"type":"content_block_start","index":2,"content_block":{"type":"tool_use","id":"toolu_01T9we4nEzyv7WVoiazyK8VE","name":"WebSearch","input":{},"caller":{"type":"direct"}}          }

/v1/responses

{
    "type": "response.completed",
    "response": {
        ...
        "output": [
            {
                "id": "rs_0a139fe840c9c4640169b296e3eb088195886c12786026df1a",
                "type": "reasoning",
                "encrypted_content": "...",
                "summary": []
            },
            {
                "id": "msg_0a139fe840c9c4640169b296ebad1081958b3077bf7ae243ee",
                "type": "message",
                "status": "completed",
                "content": [
                    {
                        "type": "output_text",
                        "annotations": [],
                        "logprobs": [],
                        "text": "I’m reading both paths directly so I can return their contents or tell you if either one is missing or not a regular file."
                    }
                ],
                "phase": "commentary",
                "role": "assistant"
            },
            {
                "id": "fc_0a139fe840c9c4640169b296ed6ebc8195b5cc1641dac78918",
                "type": "function_call",
                "status": "completed",
                "arguments": "{\"cmd\":\"sed -n '1,250p' /tmp/foo\",\"yield_time_ms\":1000,\"max_output_tokens\":6000,\"workdir\":\"/home/coder/coder\"}",
                "call_id": "call_JmiKwclDrwqU1w2Ojk0E9wQB",
                "name": "exec_command"
            },
            {
                "id": "fc_0a139fe840c9c4640169b296ed6ecc819586447dbe18e23fff",
                "type": "function_call",
                "status": "completed",
                "arguments": "{\"cmd\":\"sed -n '1,250p' /home/bar\",\"yield_time_ms\":1000,\"max_output_tokens\":6000,\"workdir\":\"/home/coder/coder\"}",
                "call_id": "call_juUonJYUHxtCJb6ckr9tNTv5",
                "name": "exec_command"
            }
        ],
        "parallel_tool_calls": true,
        ...
    }
}

dannykopping · 2026-03-05T15:37:28Z

feat: record model thoughts #203 👈 (View in Graphite)
feat: only disable parallel tool calls when injectable tools are present #211
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

intercept/responses/base.go

pawbana · 2026-03-06T15:06:30Z

recorder/types.go

 	InvocationError error
 	Metadata        Metadata
 	CreatedAt       time.Time
+	ModelThoughts   []*ModelThoughtRecord


How much space could ModelThoughts take?
Maybe we should consider configuring recording things not only due to compliance reasons but also to limit space usage.

It all depends on how many thinking tokens are given to the model, and even then we're only capturing a summary, so I suspect it won't be much. In any case, having this information will help provide insight into why agents take the course of action they do, which may end up being valuable for an audit scenario.

In any case, disk is cheap.

We can later add a flag to not store these conditionally.

intercept/messages/blocking.go

pawbana · 2026-03-09T11:44:06Z

intercept/messages/blocking.go

-
+			// Clear after first use to avoid duplicating across
+			// multiple tool calls in the same message.
+			thoughtRecords = nil


Maybe it is a bit out of scope but this is a bit confusing to me and it may be a good time for a small cleanup.

If I understand correctly clearing thoughtRecords should not matter right now as parallel tool calls are disabled but it is added just in case they are enabled in the future? Or it is just to make sure that same thoughts are not stored in 2 tool calls? In later case I think it would be better to have duplicated information in 2 calls then 1 call with none.

At the same time construction of thoughtRecords is not prepared for parallel case, it just concatenates all thinking blocks from Content. How about constructing thoughtRecords slice in this for loop and clearing it on each RecordToolUsage call?

I see later thoughtRecords are used in range pendingToolCalls loop but I think those calls should also be processed in this loop. Then it should be possible to aggregate thinking blocks per call properly (assuming thinking blocks are properly ordered with tool calls dividing them) and would make code a bit simpler. Maybe tool call processing could even be extracted to base struct.

I think this is related to @pawbana comment, but just to clarify: IIUC we're storing all thinking records on the first tool call, and subsequent calls get none. Not sure how we're planning to present this in the UI, but all thinking would be associated with a single tool call, which might not be accurate.

Is the purpose of this mapping between tool calls and thinking to "understand why the model chose to use this tool"? If so, I like @pawbana suggestion of aggregating thinking blocks per call. Otherwise, I'm not sure this mapping of tool calls to thinking is the right approach 🤔

I've updated the description which should answer both of your questions 👍

We only track model thoughts in relation to tool use; all our thoughts are irrelevant in the same way that we don't store regular inference responses.

I don't like that some thoughts are irrelevant but all thoughts will be recorded with a first tool call regardless if thoughts are related to that call or not.
I see why trigger to record thoughts is existence of some tool call in response but I don't like connecting thoughts to tool call since there is no strong connection between them / some thoughts may be completely irrelevant.

Right now the two supported APIs do not "interleave" thoughts per tool use: all the thinking is done upfront which results in the multiple tool uses.

If thoughts are not interleaven between tool calls why there are recorded using RecordToolUsage? Because this is the only thing that is recorded per inner agentic loop iteration?

If my assumption about inner loop is correct I understand why this is done this way but recording all thoughts with first tool call feels like a hack. How about introducing new method RecordThoughts? In #203 there is already separate table for thoughts.

I fell like following structure better reflects reality:

interception has:

multiple loops, each loop has:

one thoughts record

can have multiple tool call records (when parallel tool calls are enabled).

For each loop we could match thoughts to tool calls by time (since each loop iteration should be quite slow) or we could add new field that would store on which iteration of inner loop stuff happened.

With deprecation of inner loop we could simplify this into interception having one thought record + multiple tool call records.

intercept/responses/base.go

intercept/messages/blocking.go

pawbana · 2026-03-09T12:26:39Z

Maybe I'm missing something but is there a reason thinking blocks are merged into RecordToolUsage and recorded only on tool call? (inner loop abstraction?)
I understand they are usually connected but I'd think there could be some reasoning without any tool call or provider could call tool directly.
Maybe thinking blocks should be part of RecordInterceptionEnded?

fixtures/anthropic/simple.txtar

fixtures/fixtures.go

ssncferreira · 2026-03-09T18:43:40Z

intercept/messages/blocking.go

-
+			// Clear after first use to avoid duplicating across
+			// multiple tool calls in the same message.
+			thoughtRecords = nil


I think this is related to @pawbana comment, but just to clarify: IIUC we're storing all thinking records on the first tool call, and subsequent calls get none. Not sure how we're planning to present this in the UI, but all thinking would be associated with a single tool call, which might not be accurate.

Is the purpose of this mapping between tool calls and thinking to "understand why the model chose to use this tool"? If so, I like @pawbana suggestion of aggregating thinking blocks per call. Otherwise, I'm not sure this mapping of tool calls to thinking is the right approach 🤔

ssncferreira · 2026-03-09T18:55:19Z

intercept/messages/streaming.go

+					case anthropic.ThinkingBlock:
+						thoughtRecords = append(thoughtRecords, &recorder.ModelThoughtRecord{
+							Content:   variant.Thinking,
+							CreatedAt: time.Now(),


AFAIK, with streaming, the code waits until we get a stop block and processes all thinking blocks at that point, meaning they'll all have the same CreatedAt. I assume the ordering is still preserved by their position in the slice, so this probably doesn't matter, but worth noting that CreatedAt won't reflect when each block actually arrived. Could this be an issue?

This is actually true of tool usages as well which are recorded at this stage, as well.

Postgres is microsecond-precise, so some records may be persisted with the same timestamp.

The timestamp for thoughts doesn't really matter, though; as long as it's associated to a tool call it'll be displayed correctly (i.e. before the tool call).

Signed-off-by: Danny Kopping <danny@coder.com>

fixtures/anthropic/simple.txtar

pawbana · 2026-03-13T10:15:41Z

intercept/messages/blocking.go

-
+			// Clear after first use to avoid duplicating across
+			// multiple tool calls in the same message.
+			thoughtRecords = nil


We only track model thoughts in relation to tool use; all our thoughts are irrelevant in the same way that we don't store regular inference responses.

I don't like that some thoughts are irrelevant but all thoughts will be recorded with a first tool call regardless if thoughts are related to that call or not.
I see why trigger to record thoughts is existence of some tool call in response but I don't like connecting thoughts to tool call since there is no strong connection between them / some thoughts may be completely irrelevant.

Right now the two supported APIs do not "interleave" thoughts per tool use: all the thinking is done upfront which results in the multiple tool uses.

If thoughts are not interleaven between tool calls why there are recorded using RecordToolUsage? Because this is the only thing that is recorded per inner agentic loop iteration?

If my assumption about inner loop is correct I understand why this is done this way but recording all thoughts with first tool call feels like a hack. How about introducing new method RecordThoughts? In #203 there is already separate table for thoughts.

I fell like following structure better reflects reality:

interception has:

multiple loops, each loop has:

one thoughts record

can have multiple tool call records (when parallel tool calls are enabled).

For each loop we could match thoughts to tool calls by time (since each loop iteration should be quite slow) or we could add new field that would store on which iteration of inner loop stuff happened.

With deprecation of inner loop we could simplify this into interception having one thought record + multiple tool call records.

intercept/messages/base.go

dannykopping mentioned this pull request Mar 5, 2026

feat: record and correlate tool call IDs across interceptions #188

Merged

dannykopping mentioned this pull request Mar 5, 2026

feat: add client session tracking #198

Merged

dannykopping force-pushed the dk/model-thoughts branch from 8682f8d to 56ff5d0 Compare March 6, 2026 07:28

dannykopping changed the base branch from dk/session-id-tracking to graphite-base/203 March 6, 2026 11:52

dannykopping force-pushed the graphite-base/203 branch from 1d5e6ac to e559e5e Compare March 6, 2026 11:52

dannykopping force-pushed the dk/model-thoughts branch from 56ff5d0 to 11d859c Compare March 6, 2026 11:52

graphite-app bot changed the base branch from graphite-base/203 to main March 6, 2026 11:53

dannykopping force-pushed the dk/model-thoughts branch from 11d859c to 855e3a7 Compare March 6, 2026 11:53

dannykopping commented Mar 6, 2026

View reviewed changes

intercept/responses/base.go Show resolved Hide resolved

dannykopping marked this pull request as ready for review March 6, 2026 14:30

dannykopping requested review from pawbana and ssncferreira March 6, 2026 14:30

dannykopping force-pushed the dk/model-thoughts branch from d09d5b0 to a0ad392 Compare March 6, 2026 15:14

pawbana reviewed Mar 9, 2026

View reviewed changes

intercept/messages/blocking.go Outdated Show resolved Hide resolved

ssncferreira reviewed Mar 9, 2026

View reviewed changes

dannykopping changed the base branch from main to graphite-base/203 March 11, 2026 08:27

dannykopping force-pushed the dk/model-thoughts branch from a0ad392 to c2f9c79 Compare March 11, 2026 08:27

dannykopping changed the base branch from graphite-base/203 to dk/parallel-tool-calls March 11, 2026 08:27

dannykopping mentioned this pull request Mar 11, 2026

feat: only disable parallel tool calls when injectable tools are present #211

Merged

dannykopping marked this pull request as draft March 11, 2026 08:28

dannykopping force-pushed the dk/parallel-tool-calls branch from a0e150e to b2e4f03 Compare March 11, 2026 12:15

dannykopping force-pushed the dk/model-thoughts branch 2 times, most recently from 8a13f42 to a492e77 Compare March 11, 2026 12:27

dannykopping force-pushed the dk/parallel-tool-calls branch from b2e4f03 to 732f3d2 Compare March 11, 2026 12:27

dannykopping force-pushed the dk/model-thoughts branch 3 times, most recently from 158d8b0 to 25b5478 Compare March 12, 2026 12:18

dannykopping requested review from pawbana and ssncferreira March 12, 2026 12:18

dannykopping mentioned this pull request Mar 12, 2026

feat: record model thoughts coder/coder#22676

Open

dannykopping marked this pull request as ready for review March 12, 2026 13:44

dannykopping force-pushed the dk/model-thoughts branch from 0773ea6 to 8db8a1c Compare March 12, 2026 13:45

dannykopping changed the base branch from dk/parallel-tool-calls to graphite-base/203 March 12, 2026 13:48

dannykopping added 10 commits March 12, 2026 13:49

feat: record model thoughts

f822439

Signed-off-by: Danny Kopping <danny@coder.com>

fix: send model thoughts with tool usage recording

45ca516

Signed-off-by: Danny Kopping <danny@coder.com>

feat: capture responses reasoning

73470b2

Signed-off-by: Danny Kopping <danny@coder.com>

chore: refactor tests

9caa501

Signed-off-by: Danny Kopping <danny@coder.com>

chore: test multiple thoughts

8e8e110

Signed-off-by: Danny Kopping <danny@coder.com>

chore: refactor tests

cc7431e

Signed-off-by: Danny Kopping <danny@coder.com>

feat: capture commentary as model thoughts

2df04fe

Signed-off-by: Danny Kopping <danny@coder.com>

chore: cleaning up /v1/messages impl

6988515

Signed-off-by: Danny Kopping <danny@coder.com>

chore: update comments

f4347de

Signed-off-by: Danny Kopping <danny@coder.com>

chore: add tests for parallel tool calls

81c22b9

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping force-pushed the dk/model-thoughts branch from 8db8a1c to 7a0d206 Compare March 12, 2026 13:49

dannykopping force-pushed the graphite-base/203 branch from 3466ab6 to 4fd7ded Compare March 12, 2026 13:49

graphite-app bot changed the base branch from graphite-base/203 to main March 12, 2026 13:49

chore: capture source of thinking/reasoning

87044c1

Signed-off-by: Danny Kopping <danny@coder.com>

dannykopping force-pushed the dk/model-thoughts branch from 7a0d206 to 87044c1 Compare March 12, 2026 13:50

pawbana reviewed Mar 13, 2026

View reviewed changes

Conversation

dannykopping commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dannykopping commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pawbana Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pawbana commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pawbana Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dannykopping commented Mar 5, 2026 •

edited

Loading

dannykopping commented Mar 5, 2026 •

edited

Loading

pawbana Mar 13, 2026 •

edited

Loading

pawbana commented Mar 9, 2026 •

edited

Loading

pawbana Mar 13, 2026 •

edited

Loading